Transient-fault handling

What is fault tolerance ?

Fault tolerance is the ability of a system to continue functioning even when some of its components or processes fail. It is a design principle that seeks to ensure that a system can provide continuous service in the event of hardware or software failures.

In a fault-tolerant system, failures are anticipated and accounted for, and the system is designed to handle them without causing a complete shutdown or loss of data. This is typically achieved through redundancy, where multiple instances of critical components are deployed, allowing the system to continue operating even if some instances fail.

Transient faults

A transient fault is a temporary and intermittent failure that occurs in a system or application due to environmental or operational factors, such as network connectivity issues, server overload, or database timeouts. Transient faults are typically short-lived and resolve themselves without any intervention.

Transient faults are different from permanent faults, which are caused by hardware or software failures that require repair or replacement. Permanent faults typically require manual intervention to resolve, whereas transient faults may resolve themselves spontaneously. \ gRPC calls can be interrupted by transient faults:

Momentary loss of network connectivity.
Temporary loss of a service availability.
the server is loaded and can't process requests anymore. When a gRPC call is interrupted, the client throws an RpcException with details about the error. The client app must catch the exception and choose how to handle the error.

Transient faults are often handled using retry logic or other fault tolerance mechanisms.

Title
var client = new Greeter.GreeterClient(channel);
try
{
    var response = await client.SayHelloAsync(
        new HelloRequest { Name = ".NET" });

    Console.WriteLine("From server: " + response.Message);
}
catch (RpcException ex)
{
    // Write logic to inspect the error and retry
    // if the error is from a transient fault.
}

Treating transient faults

When it comes to solving a transient fault, it means that you should have a 'strategy' about when and how to retry. If the server is fully loaded, is a good idea to give it a small amount of time to finished what it currently has in its queue, than making requests again and again , 10 milion times. For sure the end result won't be any different.

Duplicating the retry logic and 'algorithm's that you might want to apply is verbose, time-consuming and error prone. This is reason enough to find a global way to apply the same policy to all the calls.

Other resources

Retries - best practices for cloud applications